Notes

Question 1: Data frame, subsetting, function, and file I/O

Read in the states.txt file into a data frame as described.

  1. Use logical subsetting to extract a numeric vector called murder_lowincome containing murder rates for just those states with per capita incomes (the 3rd column) less than the median per capita income. Similarly, extract a vector called murder_highincome containing murder rates for just those states with greater than (or equal to) the median per capita income. Run a two-sample t.test() to determine whether the mean murder rates are different between these two groups.
# Read table
states <- read.table("states.txt", header = TRUE, sep = "\t")
# Calculate the median income
median_income <- median(states$income)
# Calculate the murder rates for just those states with per capita incomes less and greater than the median per capita income
murder_lowincome <- states$murder[states$income < median_income]
murder_highincome <- states$murder[states$income >= median_income]
# Run t-test
t.test(murder_lowincome, murder_highincome)
## 
##  Welch Two Sample t-test
## 
## data:  murder_lowincome and murder_highincome
## t = 1.2364, df = 46.494, p-value = 0.2225
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8058706  3.3738706
## sample estimates:
## mean of x mean of y 
##     8.020     6.736

Solution: The p-value is 0.2225 (larger than 0.05), do not reject the H0 hypothesis. There is no significant evidence to indicate that the mean murder rates are different between these two groups.

  1. Use syntax of [<row_selector>, <column_selector>] to extract a new data frame called states_name_pop containing only the columns for name and population. Extract another data frame called states_gradincome_high containing all columns, and all rows where the income is greater than the median of income or where hs_grad is greater than the median of hs_grad. (There are 35 such states listed in the table.)
# Extract the column `name` and `population`
states_name_pop <- states[, c("name", "population")]
# Extract states with above median withdrawal income or high school graduation rates
median_income <- median(states$income)
median_hs_grad <- median(states$hs_grad)
states_gradincome_high <- states[states$income > median_income | states$hs_grad > median_hs_grad, ]
  1. Create a table called states_by_region_income, where the rows are ordered first by region and second by income. (Hint: You may want to use the order() function here. The order() function can take multiple parameters as in order(vec1, vec2), which considered in turn in determining the ordering. See more examples from here
# Sorted by region and income
states_by_region_income <- states[order(states$region, states$income), ]
  1. Write a function called normalize_mean_sd() that takes such a vector and returns the normalized version. The function should work even if any values are NA (the normalized version of NA should simply be NA). To “normalize” a vector of numbers, first subtract the mean from each number and then divide each by the standard deviation of the numbers. Use vector like sample <- c(3.2, 5.1, 2.4, 1.6, NA, 7.9) to test your function.
# Normalisation
normalize_mean_sd <- function(x) {
  mean_x <- mean(x, na.rm = TRUE)
  sd_x <- sd(x, na.rm = TRUE)
  (x - mean_x) / sd_x
}
sample <- c(3.2, 5.1, 2.4, 1.6, NA, 7.9)
normalized_sample <- normalize_mean_sd(sample)
normalized_sample
## [1] -0.3335277  0.4208802 -0.6511732 -0.9688186         NA  1.5326393
  1. Use apply() or another function from the apply family, together with the function normalize_mean_sd() that you created, to normalize the data columns in states.txt. Then write the new table to a new file called states_normalized.txt.
# Normalize the data columns `population`, `income`, `murder` and `hs_grad`
normalized_data  <- as.data.frame(lapply(states[, c("population", "income", "murder", "hs_grad")], normalize_mean_sd))
states_normalized <- cbind(states[, c("name", "region")], as.data.frame(normalized_data))
# Write to new table
write.table(states_normalized, "states_normalized.txt", sep = "\t", row.names = FALSE)
head(states_normalized)

Question 2: R Plotting

The data file ozone.csv was obtained from the supplementary data of Biostatistics: A Methodology for the Health Sciences, describing the weather conditions in New York City in 1973. Full description available here.

  1. Make the scatter plot of Solar Radiation against Ozone, the histogram of Wind Speed and the Boxplot of Ozone level per month. They should look as follows. Hint: For the histogram, see the breaks and freq arguments to create 20 bins and display density rather than frequency. For the box plot, try the rainbow function; the colors are not necessary the same as in the figure below. The las argument changes the label orientation. See ?par. Look at the arguments to boxplot to see how to change the names printed under each box.
# Read the dataset Ozone
ozone <- read.csv("ozone.csv")
par(mfrow = c(1, 3))
# Scatter plot of Solar Radiation against Ozone
plot(ozone$Ozone, ozone$Solar.R, main = "Relationship between\nsolar radiation and ozone level", 
     xlab = "Ozone level", ylab = "Solar Radiation", pch = 16, col = "#fea400")
# Histogram of Wind Speed
hist(ozone$Wind, breaks = 20, freq = FALSE, main = "Distribution of Wind speed", 
     xlab = "Wind Speed", col = "#9f1fef")
# Boxplot of Ozone level per month
ozone$Month <- factor(ozone$Month, levels = 5:9, labels = c("May", "Jun", "Jul", "Aug", "Sep"))
boxplot(ozone$Ozone ~ ozone$Month, xlab = "Month", ylab = "Ozone", main = "Distribution of Ozone per-month",
        col = c("#fe0000", "#cbfe00", "#00fe65", "#0065fe", "#9f1fef"), las = 2)

  1. Create a layout with three columns with par function. Then plot Ozone versus Solar Radiation, Wind Speed and Temperature on separate graphs. Use different colors and plotting characters on each plot. At last, save the plot to a pdf. HINT: Create the graph first in RStudio. When you’re happy with it, re-run the code preceded by the pdf function to save to a file. Don’t forget to use dev.off() to close the file.
par(mfrow = c(1, 3))
# Scatter plot of Ozone against Solar Radiation
plot(ozone$Solar.R, ozone$Ozone, main = "Relationship between\nsolar radiation and ozone level", 
     xlab = "Solar Radiation", ylab = "Ozone level", pch = 16, col = "#8fed8f")
# Scatter plot of Ozone against Wind Speed
plot(ozone$Wind, ozone$Ozone, main = "Relationship between\nwind speed and ozone level", 
     xlab = "Wind Speed", ylab = "Ozone level", pch = 15, col = "#4581b3")
# Scatter plot of Ozone against Temperature
plot(ozone$Temp, ozone$Ozone, main = "Relationship between\ntemperature and ozone level", 
     xlab = "Temperature", ylab = "Ozone level", pch = 17, col = "#fea400")

# Save to pdf
dev.copy(pdf, "ozone_plots.pdf", width = 10, height = 3)
## pdf 
##   3
dev.off()
## png 
##   2
  1. Temperature and Ozone level seem to be correlated. However, there are some observations that do not seem to fit the trend, especially those with Ozone level > 100. Modify the plot so that these outlier observations are in a different colour. Add a legend to help interpret the plot. You can break down the problem into the following steps
# Find the outliers
outliers <- ozone$Ozone > 100
# Create a blank plot
plot(0, 0, type = "n", xlim = range(ozone$Temp, na.rm = TRUE), ylim = range(ozone$Ozone, na.rm = TRUE), 
     xlab = "Temperature", ylab = "Ozone level", main = "Relationship between temperature and ozone level")
# Plot the outliers
points(ozone$Temp[outliers], ozone$Ozone[outliers], pch = 17, col = "#fe0000")
# Plot the non-outliers
points(ozone$Temp[!outliers], ozone$Ozone[!outliers], pch = 17, col = "#fea400")
# Add a legend
legend("topleft", legend = c("Ozone > 100", "Normal Ozone"), col = c("#fe0000", "#fea400"), pch = 17)
# Add a dividing line
abline(h = 100, lty = 2)

  1. Create the figures in question (b) again using ggplot2.
library(ggplot2)
library(gridExtra)
# Scatter plot of Ozone against Solar Radiation
p1 <- ggplot(ozone, aes(x = Solar.R, y = Ozone)) + 
  geom_point(pch = 16, color = "#8fed8f") + 
  labs(x = "Solar Radiation", y = "Ozone level") + 
  theme_bw()
# Scatter plot of Ozone against Wind Speed
p2 <- ggplot(ozone, aes(x = Wind, y = Ozone)) + 
  geom_point(pch = 15, color = "#4581b3") + 
  labs(x = "Wind Speed", y = "Ozone level") + 
  theme_bw()
# Scatter plot of Ozone against Temperature
p3 <- ggplot(ozone, aes(x = Temp, y = Ozone)) + 
  geom_point(pch = 17, color = "#fea400") + 
  labs(x = "Temperature", y = "Ozone level") + 
  theme_bw()
plot_grid <- grid.arrange(p1, p2, p3, ncol = 3, nrow = 1)

# Save to pdf
ggsave("ozone_plots_ggplot2.pdf", plot = plot_grid, device = "pdf", width = 10, height = 3)

Question 3: Hypothesis testing

The gene expression data collected by Golub et al. (1999) are among the classical in bioinformatics. The data are stored in golub.txt, containing gene expression values of 3051 genes (rows) from 38 leukemia patients (columns). Twenty-seven patients (column 1 to 27) are diagnosed as acute lymphoblastic leukemia (ALL) and eleven (column 28 to 38) as acute myeloid leukemia (AML). The tumor class of ALL is 0 (negative), while the tumor class of AML is 1 (positive).

The important gene CD33 is among one of the investigated genes. It has its expression values in row 808 of the Golub data. Suppose that normality of the ALL and AML expression values has been validated and assume equal variance. Test the equality of the means by an appropriate test about gene CD33. Formulate the null hypothesis, the p-value and your conclusion.

# Read the dataset
golub <- read.table("golub.txt", header = TRUE)
# Extract the expression values of gene CD33
irow = 808
cd33_all <- golub[irow, 1:27]
cd33_aml <- golub[irow, 28:38]
# Perform t-test
t.test(cd33_all, cd33_aml, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  cd33_all and cd33_aml
## t = -7.9813, df = 36, p-value = 1.773e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.5487898 -0.9211602
## sample estimates:
##  mean of x  mean of y 
## -0.8812041  0.3537709

Solution:

Conclusion: The p-value is 1.773e-09 (smaller than 0.05), reject the H0 hypothesis. The means of the gene CD33 expression values in lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) patients are significantly different.

Question 4: Categorical Analysis

Researchers conducted a study to investigate the relationship between a specific genetic mutation (A) and susceptibility to a rare neurological disorder (N). They collected data from a cohort of 150 individuals and summarized it in a 2x2 contingency table as follows:

                    Disorder (N)
                Present     Absent 
                ---------   ------
Mutation (A)        25      60 
No Mutation (A)     30      35

Perform a categorical analysis on this data using R and determine whether there is a significant association between the genetic mutation (A) and the occurrence of the disease (D) at a significance level of 0.05. Clearly state your steps of hypothesis testing to get full marks.

# Create the contingency table
table <- matrix(c(25, 60, 30, 35), nrow = 2, byrow = TRUE)
dimnames(table) <- list(Mutation = c("A", "No A"), Disorder = c("Present", "Absent"))
# Perform the chi-square test
chisq.test(table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 3.7541, df = 1, p-value = 0.05268

Solution: The p-value is 0.05268 (larger than 0.05), do not reject the H0 hypothesis. There is no significant association between the genetic mutation (A) and the occurrence of the disease (D) at a significance level of 0.05.

//

sessionInfo

sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C                               
## [5] LC_TIME=Chinese (Simplified)_China.utf8    
## 
## time zone: Asia/Hong_Kong
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gridExtra_2.3 ggplot2_3.5.1
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5       cli_3.6.3         knitr_1.49        rlang_1.1.4      
##  [5] xfun_0.48         generics_0.1.3    textshaping_0.4.0 jsonlite_1.8.9   
##  [9] labeling_0.4.3    glue_1.8.0        colorspace_2.1-1  htmltools_0.5.8.1
## [13] ragg_1.3.3        sass_0.4.9        fansi_1.0.6       scales_1.3.0     
## [17] rmarkdown_2.29    grid_4.4.1        evaluate_1.0.1    munsell_0.5.1    
## [21] jquerylib_0.1.4   tibble_3.2.1      fastmap_1.2.0     yaml_2.3.10      
## [25] lifecycle_1.0.4   compiler_4.4.1    dplyr_1.1.4       pkgconfig_2.0.3  
## [29] rstudioapi_0.17.1 systemfonts_1.1.0 farver_2.1.2      digest_0.6.37    
## [33] R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4        pillar_1.9.0     
## [37] magrittr_2.0.3    bslib_0.8.0       withr_3.0.2       tools_4.4.1      
## [41] gtable_0.3.6      cachem_1.1.0